Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve benchmark scripts via trials #1908

Merged
merged 1 commit into from
Jul 30, 2024
Merged

Conversation

charleskawczynski
Copy link
Member

@charleskawczynski charleskawczynski commented Jul 30, 2024

It turns out that running multiple trials, like BenchmarkTools.@btime is important to make a fair comparison and reduce the noise. We could use BenchmarkTime's @btime in the outer part and CUDA.@sync with an nreps loop for the inner part, but then we'd also need to interpolate a bunch, so I just went with writing simple loops instead.

Here is a summary of the updated numbers:

Indexing & static ndranges

[ Info: ArrayType = CuArray
Problem size: (63, 4, 4, 1, 5400), float_type = Float32, device_bandwidth_GBs=2039
┌─────────────────────────────────────────────────────────────────────────────┬──────────────────────────────────┬─────────┬─────────────┬────────────────┬────────┐
│ funcs                                                                       │ time per call                    │ bw %    │ achieved bw │ n-reads/writes │ n-reps │
├─────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────┼─────────┼─────────────┼────────────────┼────────┤
│ BSR.at_dot_call!(X_array, Y_array; nreps=1000, bm)                          │ 68 microseconds, 641 nanoseconds │ 14.4882295.41511000   │
│ BSR.at_dot_call!(X_vector, Y_vector; nreps=1000, bm)                        │ 13 microseconds, 787 nanoseconds │ 72.13661470.8611000   │
│ iscpu || BSR.custom_sol_kernel!(X_vector, Y_vector, Val(N); nreps=1000, bm) │ 12 microseconds, 925 nanoseconds │ 76.9431568.8711000   │
│ BSR.custom_kernel_bc!(X_vector, Y_vector, us; nreps=1000, bm)               │ 13 microseconds, 364 nanoseconds │ 74.41951517.4111000   │
│ BSR.custom_kernel_bc!(X_vector, Y_vector, uss; nreps=1000, bm)              │ 12 microseconds, 929 nanoseconds │ 76.92471568.4911000   │
│ BSR.custom_kernel_bc!(X_array, Y_array, us; use_pw=false, nreps=1000, bm)   │ 41 microseconds, 5 nanoseconds   │ 24.2533494.52511000   │
│ BSR.custom_kernel_bc!(X_array, Y_array, uss; use_pw=false, nreps=1000, bm)  │ 26 microseconds, 652 nanoseconds │ 37.3141760.83511000   │
│ BSR.custom_kernel_bc!(X_array, Y_array, us; use_pw=true, nreps=1000, bm)    │ 13 microseconds, 582 nanoseconds │ 73.22431493.0411000   │
│ BSR.custom_kernel_bc!(X_array, Y_array, uss; use_pw=true, nreps=1000, bm)   │ 12 microseconds, 922 nanoseconds │ 76.96131569.2411000   │
└─────────────────────────────────────────────────────────────────────────────┴──────────────────────────────────┴─────────┴─────────────┴────────────────┴────────┘
[ Info: ArrayType = CuArray
Problem size: (63, 4, 4, 1, 5400), float_type = Float64, device_bandwidth_GBs=2039
┌─────────────────────────────────────────────────────────────────────────────┬──────────────────────────────────┬─────────┬─────────────┬────────────────┬────────┐
│ funcs                                                                       │ time per call                    │ bw %    │ achieved bw │ n-reads/writes │ n-reps │
├─────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────┼─────────┼─────────────┼────────────────┼────────┤
│ BSR.at_dot_call!(X_array, Y_array; nreps=1000, bm)                          │ 69 microseconds, 10 nanoseconds  │ 28.8217587.67311000   │
│ BSR.at_dot_call!(X_vector, Y_vector; nreps=1000, bm)                        │ 28 microseconds, 219 nanoseconds │ 70.48481437.1811000   │
│ iscpu || BSR.custom_sol_kernel!(X_vector, Y_vector, Val(N); nreps=1000, bm) │ 25 microseconds, 460 nanoseconds │ 78.12211592.9111000   │
│ BSR.custom_kernel_bc!(X_vector, Y_vector, us; nreps=1000, bm)               │ 25 microseconds, 625 nanoseconds │ 77.61941582.6611000   │
│ BSR.custom_kernel_bc!(X_vector, Y_vector, uss; nreps=1000, bm)              │ 25 microseconds, 436 nanoseconds │ 78.19751594.4511000   │
│ BSR.custom_kernel_bc!(X_array, Y_array, us; use_pw=false, nreps=1000, bm)   │ 41 microseconds, 621 nanoseconds │ 47.7881974.411000   │
│ BSR.custom_kernel_bc!(X_array, Y_array, uss; use_pw=false, nreps=1000, bm)  │ 27 microseconds, 111 nanoseconds │ 73.36541495.9211000   │
│ BSR.custom_kernel_bc!(X_array, Y_array, us; use_pw=true, nreps=1000, bm)    │ 25 microseconds, 931 nanoseconds │ 76.7031563.9711000   │
│ BSR.custom_kernel_bc!(X_array, Y_array, uss; use_pw=true, nreps=1000, bm)   │ 25 microseconds, 464 nanoseconds │ 78.10951592.6511000   │
└─────────────────────────────────────────────────────────────────────────────┴──────────────────────────────────┴─────────┴─────────────┴────────────────┴────────┘

Index swapping

[ Info: ArrayType = CuArray
Problem size: (63, 4, 4, 1, 5400), float_type = Float32, device_bandwidth_GBs=2039
┌──────────────────────────────────────────────────────────────────────┬──────────────────────────────────┬─────────┬─────────────┬────────────────┬────────┐
│ funcs                                                                │ time per call                    │ bw %    │ achieved bw │ n-reads/writes │ n-reps │
├──────────────────────────────────────────────────────────────────────┼──────────────────────────────────┼─────────┼─────────────┼────────────────┼────────┤
│ BIS.at_dot_call!(X_vector, Y_vector; nreps=1000, bm)                 │ 34 microseconds, 617 nanoseconds │ 57.45741171.5621000   │
│ BIS.custom_kernel_bc!(X_array, Y_array, uss; swap=0, nreps=1000, bm) │ 60 microseconds, 384 nanoseconds │ 32.939671.62721000   │
│ BIS.custom_kernel_bc!(X_array, Y_array, uss; swap=1, nreps=1000, bm) │ 68 microseconds, 108 nanoseconds │ 29.2034595.45821000   │
│ BIS.custom_kernel_bc!(X_array, Y_array, uss; swap=2, nreps=1000, bm) │ 60 microseconds, 395 nanoseconds │ 32.9329671.50221000   │
└──────────────────────────────────────────────────────────────────────┴──────────────────────────────────┴─────────┴─────────────┴────────────────┴────────┘
[ Info: ArrayType = CuArray
Problem size: (63, 4, 4, 1, 5400), float_type = Float64, device_bandwidth_GBs=2039
┌──────────────────────────────────────────────────────────────────────┬──────────────────────────────────┬─────────┬─────────────┬────────────────┬────────┐
│ funcs                                                                │ time per call                    │ bw %    │ achieved bw │ n-reads/writes │ n-reps │
├──────────────────────────────────────────────────────────────────────┼──────────────────────────────────┼─────────┼─────────────┼────────────────┼────────┤
│ BIS.at_dot_call!(X_vector, Y_vector; nreps=1000, bm)                 │ 59 microseconds, 558 nanoseconds │ 66.7911361.8721000   │
│ BIS.custom_kernel_bc!(X_array, Y_array, uss; swap=0, nreps=1000, bm) │ 63 microseconds, 238 nanoseconds │ 62.9051282.6321000   │
│ BIS.custom_kernel_bc!(X_array, Y_array, uss; swap=1, nreps=1000, bm) │ 80 microseconds, 502 nanoseconds │ 49.41421007.5621000   │
│ BIS.custom_kernel_bc!(X_array, Y_array, uss; swap=2, nreps=1000, bm) │ 63 microseconds, 228 nanoseconds │ 62.91421282.8221000   │
└──────────────────────────────────────────────────────────────────────┴──────────────────────────────────┴─────────┴─────────────┴────────────────┴────────┘

Summary

  • Indexing & static ndranges:

    • The very slow (array + dynamic size) kernel is (strangely) about the same between Float32 and Float64
    • The SOL kernel (vector + static size) kernel is (expectedly) 2x faster for Float32 than Float64
    • The array + fully static size is 2x slower than SOL for Float32 and only 8% slower than SOL for Float64
    • The array + ("forced") linear indexing is equivalent to SOL for Float32 and Float64
  • Index swapping (all static sizes):

    • The SOL kernel (vector, no swap) kernel is (expectedly) 2x faster for Float32 than Float64.
    • The main take away: index swapping has a much greater performance degradation for Float32 than it does for Float64.

@charleskawczynski charleskawczynski merged commit e2d61b0 into main Jul 30, 2024
16 of 19 checks passed
@charleskawczynski charleskawczynski deleted the ck/upate_benchmarks branch July 30, 2024 20:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant